2025-05-01-15-00
Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models
Abstract
arXiv:2504.21277v1 Announce Type: new Abstract: The integration of reinforcement learning (RL) into the reasoning capabilities of Multimodal Large Language Models (MLLMs) has rapidly emerged as a transformative research direction. While MLLMs significantly extend Large Language Models (LLMs) to handle diverse modalities such as vision, audio, and video, enabling robust reasoning across multimodal inputs remains a major challenge. This survey systematically reviews recent advances in RL-based reasoning for MLLMs, covering key algorithmic designs, reward mechanism innovations, and practical applications. We highlight two main RL paradigms--value-free and value-based methods--and analyze how RL enhances reasoning abilities by optimizing reasoning trajectories and aligning multimodal information. Furthermore, we provide an extensive overview of benchmark datasets, evaluation protocols, and existing limitations, and propose future research directions to address current bottlenecks such as sparse rewards, inefficient cross-modal reasoning, and real-world deployment constraints. Our goal is to offer a comprehensive and structured guide to researchers interested in advancing RL-based reasoning in the multimodal era.
摘要
将强化学习(RL)融入多模态大语言模型(MLLMs)的推理能力,已迅速成为一个变革性的研究方向。尽管MLLMs显著扩展了大语言模型(LLMs)处理视觉、音频和视频等多种模态的能力,但实现跨模态输入的稳健推理仍面临重大挑战。本文系统综述了基于RL的MLLMs推理的最新进展,涵盖关键算法设计、奖励机制创新及实际应用。我们重点分析了两种主要RL范式——无价值函数与基于价值函数的方法,并阐释了RL如何通过优化推理轨迹和对齐多模态信息来增强推理能力。此外,我们全面梳理了基准数据集、评估协议及现有局限性,并针对稀疏奖励、低效跨模态推理和现实部署约束等当前瓶颈问题,提出了未来研究方向。本研究旨在为推进多模态时代基于RL的推理研究提供全面而结构化的指南。
On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks
Abstract
arXiv:2504.21074v1 Announce Type: new Abstract: Large language models (LLMs) have shown to be valuable tools for tackling process mining tasks. Existing studies report on their capability to support various data-driven process analyses and even, to some extent, that they are able to reason about how processes work. This reasoning ability suggests that there is potential for LLMs to tackle semantics-aware process mining tasks, which are tasks that rely on an understanding of the meaning of activities and their relationships. Examples of these include process discovery, where the meaning of activities can indicate their dependency, whereas in anomaly detection the meaning can be used to recognize process behavior that is abnormal. In this paper, we systematically explore the capabilities of LLMs for such tasks. Unlike prior work, which largely evaluates LLMs in their default state, we investigate their utility through both in-context learning and supervised fine-tuning. Concretely, we define five process mining tasks requiring semantic understanding and provide extensive benchmarking datasets for evaluation. Our experiments reveal that while LLMs struggle with challenging process mining tasks when used out of the box or with minimal in-context examples, they achieve strong performance when fine-tuned for these tasks across a broad range of process types and industries.
摘要
大型语言模型(LLMs)已被证明是解决流程挖掘任务的有力工具。现有研究证实其能够支持多种数据驱动的流程分析,甚至在一定程度上具备对流程运作原理的推理能力。这种推理能力表明LLMs具备处理语义感知流程挖掘任务的潜力,这类任务依赖于对活动含义及其关系的理解。例如在流程发现中,活动含义可指示其依赖关系;而在异常检测中,语义信息可用于识别异常流程行为。本文系统性地探索了LLMs在此类任务中的能力。与主要评估默认状态下LLMs的先前研究不同,我们通过上下文学习和监督微调两种方式考察其实用性。具体而言,我们定义了五项需要语义理解的流程挖掘任务,并提供大量基准数据集用于评估。实验表明:虽然LLMs在直接使用或仅提供少量上下文示例时难以应对具有挑战性的流程挖掘任务,但经过针对不同流程类型和行业的任务微调后,其表现显著提升。
Theoretical Foundations for Semantic Cognition in Artificial Intelligence
Abstract
arXiv:2504.21218v1 Announce Type: new Abstract: This monograph presents a modular cognitive architecture for artificial intelligence grounded in the formal modeling of belief as structured semantic state. Belief states are defined as dynamic ensembles of linguistic expressions embedded within a navigable manifold, where operators enable assimilation, abstraction, nullification, memory, and introspection. Drawing from philosophy, cognitive science, and neuroscience, we develop a layered framework that enables self-regulating epistemic agents capable of reflective, goal-directed thought. At the core of this framework is the epistemic vacuum: a class of semantically inert cognitive states that serves as the conceptual origin of belief space. From this foundation, the Null Tower arises as a generative structure recursively built through internal representational capacities. The theoretical constructs are designed to be implementable in both symbolic and neural systems, including large language models, hybrid agents, and adaptive memory architectures. This work offers a foundational substrate for constructing agents that reason, remember, and regulate their beliefs in structured, interpretable ways.
摘要
本专著提出了一种基于信念作为结构化语义状态形式化建模的模块化人工智能认知架构。信念状态被定义为嵌入可导航流形中的语言表达动态集合,其中操作符支持同化、抽象、消解、记忆和内省等功能。通过整合哲学、认知科学与神经科学的研究成果,我们构建了一个分层框架,使具备自我调节能力的认知主体能够进行反思性和目标导向的思维。该框架的核心是认知真空——一类语义惰性的认知状态,作为信念空间的概念起源。在此基础上,零塔结构通过内部表征能力的递归构建而生成。这些理论构造设计适用于符号系统和神经系统实现,包括大语言模型、混合智能体和自适应记忆架构。本研究为构建具有结构化、可解释性的推理、记忆和信念调节能力的智能体提供了基础性框架。
Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index
Abstract
arXiv:2504.21282v1 Announce Type: new Abstract: Natural language (NL)-driven table discovery identifies relevant tables from large table repositories based on NL queries. While current deep-learning-based methods using the traditional dense vector search pipeline, i.e., representation-index-search, achieve remarkable accuracy, they face several limitations that impede further performance improvements: (i) the errors accumulated during the table representation and indexing phases affect the subsequent search accuracy; and (ii) insufficient query-table interaction hinders effective semantic alignment, impeding accuracy improvements. In this paper, we propose a novel framework Birdie, using a differentiable search index. It unifies the indexing and search into a single encoder-decoder language model, thus getting rid of error accumulations. Birdie first assigns each table a prefix-aware identifier and leverages a large language model-based query generator to create synthetic queries for each table. It then encodes the mapping between synthetic queries/tables and their corresponding table identifiers into the parameters of an encoder-decoder language model, enabling deep query-table interactions. During search, the trained model directly generates table identifiers for a given query. To accommodate the continual indexing of dynamic tables, we introduce an index update strategy via parameter isolation, which mitigates the issue of catastrophic forgetting. Extensive experiments demonstrate that Birdie outperforms state-of-the-art dense methods by 16.8% in accuracy, and reduces forgetting by over 90% compared to other continual learning approaches.
摘要
基于自然语言(NL)驱动的表格发现技术通过NL查询从大规模表格库中识别相关表格。尽管当前基于深度学习的传统稠密向量检索流程(即表示-索引-搜索)方法取得了显著精度,但仍存在限制性能进一步提升的若干问题:(i)表格表示和索引阶段积累的误差会影响后续搜索精度;(ii)查询-表格交互不足阻碍了有效的语义对齐,制约精度提升。本文提出新型框架Birdie,采用可微分搜索索引技术,将索引与搜索统一整合至单个编码器-解码器语言模型中,从而消除误差累积。Birdie首先为每个表格分配前缀感知标识符,并利用基于大语言模型的查询生成器为每个表格创建合成查询;随后将合成查询/表格与其对应表格标识符的映射关系编码至编码器-解码器语言模型的参数中,实现深度查询-表格交互。搜索阶段,训练后的模型直接为给定查询生成表格标识符。为适应动态表格的持续索引需求,我们通过参数隔离引入索引更新策略,显著缓解灾难性遗忘问题。大量实验表明,Birdie在准确率上超越最先进稠密方法16.8%,相比其他持续学习方法减少90%以上的遗忘率。
Phi-4-reasoning Technical Report
Abstract
arXiv:2504.21318v1 Announce Type: new Abstract: We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.
摘要
我们推出Phi-4-reasoning——一个140亿参数的推理模型,该模型在复杂推理任务中表现出色。该模型通过对Phi-4进行监督微调训练而成,训练数据包括精心筛选的具有适当复杂度与多样性的"可教学"提示集,以及使用o3-mini生成的推理演示。Phi-4-reasoning能生成充分利用推理时计算资源的详细推理链。我们还开发了增强版Phi-4-reasoning-plus,该变体通过短期基于结果的强化学习进一步提升了性能,可生成更长的推理轨迹。在各类推理任务中,这两个模型的性能显著优于DeepSeek-R1-Distill-Llama-70B等更大规模的开源权重模型,并接近完整版DeepSeek-R1模型的水平。我们的综合评估涵盖数学与科学推理、编程、算法问题求解、规划及空间理解等基准测试。值得注意的是,我们还观察到模型在通用基准测试上也获得了显著提升。本报告详细阐述了训练数据构成、训练方法及评估过程。研究表明,监督微调(SFT)中精细数据筛选的优势同样适用于推理语言模型,且可通过强化学习(RL)进一步放大。最后,我们的评估指出了当前推理模型性能与鲁棒性评估方法的改进空间。
Galvatron: An Automatic Distributed System for Efficient Foundation Model Training
Abstract
arXiv:2504.21411v1 Announce Type: new Abstract: Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy, incorporating data, tensor, pipeline, sharded data, and sequence parallelism, along with recomputation. The system's architecture includes a profiler for hardware and model analysis, a search engine for strategy optimization using decision trees and dynamic programming, and a runtime for executing these strategies efficiently. Benchmarking on various clusters demonstrates Galvatron's superior throughput compared to existing frameworks. This open-source system offers user-friendly interfaces and comprehensive documentation, making complex distributed training accessible and efficient. The source code of Galvatron is available at https://github.com/PKU-DAIR/Hetu-Galvatron.
摘要
Galvatron是一个用于高效训练大规模基础模型的分布式系统。该系统通过自动识别最优混合并行策略(包含数据并行、张量并行、流水线并行、分片数据并行、序列并行以及重计算技术),克服了人工选择并行策略的复杂性。系统架构包含三个核心组件:用于硬件与模型分析的性能分析器、基于决策树与动态规划的策略优化搜索引擎,以及高效执行策略的运行时系统。在不同集群上的基准测试表明,Galvatron的吞吐量显著优于现有框架。这一开源系统提供用户友好接口与完整文档,使复杂分布式训练变得高效易用。Galvatron源代码已发布于https://github.com/PKU-DAIR/Hetu-Galvatron。
ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning
Abstract
arXiv:2504.21370v1 Announce Type: new Abstract: Reasoning models such as OpenAI o3 and DeepSeek-R1 have demonstrated strong performance on reasoning-intensive tasks through extended Chain-of-Thought (CoT) prompting. While longer reasoning traces can facilitate a more thorough exploration of solution paths for complex problems, researchers have observed that these models often "overthink", leading to inefficient inference. In this paper, we introduce ShorterBetter, a simple yet effective reinforcement learning methed that enables reasoning language models to discover their own optimal CoT lengths without human intervention. By sampling multiple outputs per problem and defining the Sample Optimal Length (SOL) as the shortest correct response among all the outputs, our method dynamically guides the model toward optimal inference lengths. Applied to the DeepSeek-Distill-Qwen-1.5B model, ShorterBetter achieves up to an 80% reduction in output length on both in-domain and out-of-domain reasoning tasks while maintaining accuracy. Our analysis shows that overly long reasoning traces often reflect loss of reasoning direction, and thus suggests that the extended CoT produced by reasoning models is highly compressible.
摘要
OpenAI o3和DeepSeek-R1等推理模型通过扩展的思维链(CoT)提示,在推理密集型任务中展现出强劲性能。虽然更长的推理轨迹有助于对复杂问题进行更彻底的求解路径探索,但研究者发现这些模型常出现"过度思考"现象,导致推理效率低下。本文提出ShorterBetter方法——一种简单而有效的强化学习策略,能使推理语言模型无需人工干预即可自主发现其最优CoT长度。该方法通过为每个问题采样多个输出,并将样本最优长度(SOL)定义为所有输出中最短的正确响应,动态引导模型趋向最优推理长度。在DeepSeek-Distill-Qwen-1.5B模型上的应用表明,ShorterBetter在保持准确率的同时,对领域内和领域外推理任务均实现了最高80%的输出长度缩减。分析显示,过长的推理轨迹往往反映推理方向的迷失,这表明推理模型生成的扩展CoT具有高度可压缩性。
MF-LLM: Simulating Collective Decision Dynamics via a Mean-Field Large Language Model Framework
Abstract
arXiv:2504.21582v1 Announce Type: new Abstract: Simulating collective decision-making involves more than aggregating individual behaviors; it arises from dynamic interactions among individuals. While large language models (LLMs) show promise for social simulation, existing approaches often exhibit deviations from real-world data. To address this gap, we propose the Mean-Field LLM (MF-LLM) framework, which explicitly models the feedback loop between micro-level decisions and macro-level population. MF-LLM alternates between two models: a policy model that generates individual actions based on personal states and group-level information, and a mean field model that updates the population distribution from the latest individual decisions. Together, they produce rollouts that simulate the evolving trajectories of collective decision-making. To better match real-world data, we introduce IB-Tune, a fine-tuning method for LLMs grounded in the information bottleneck principle, which maximizes the relevance of population distributions to future actions while minimizing redundancy with historical data. We evaluate MF-LLM on a real-world social dataset, where it reduces KL divergence to human population distributions by 47 percent over non-mean-field baselines, and enables accurate trend forecasting and intervention planning. It generalizes across seven domains and four LLM backbones, providing a scalable foundation for high-fidelity social simulation.
摘要
模拟集体决策不仅涉及个体行为的聚合,更源于个体间的动态交互。尽管大语言模型(LLMs)在社会模拟中展现出潜力,现有方法常与现实数据存在偏差。为弥合这一差距,我们提出平均场大语言模型(MF-LLM)框架,该框架显式建模微观决策与宏观群体间的反馈循环。MF-LLM交替运行两个模型:基于个体状态和群体信息生成个人行为的策略模型,以及根据最新个体决策更新群体分布的平均场模型。二者协同产生模拟集体决策演化轨迹的推演结果。为更好匹配现实数据,我们提出基于信息瓶颈原理的微调方法IB-Tune,其在最大化群体分布与未来行动相关性的同时,最小化与历史数据的冗余。我们在真实社会数据集上评估MF-LLM,其相较于非平均场基线方法将人类群体分布的KL散度降低47%,并能实现精准趋势预测与干预规划。该框架在七个领域和四种LLM骨干模型中均展现泛化能力,为高保真社会模拟提供了可扩展的基础。
AdaR1: From Long-CoT to Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization
Abstract
arXiv:2504.21659v1 Announce Type: new Abstract: Recently, long-thought reasoning models achieve strong performance on complex reasoning tasks, but often incur substantial inference overhead, making efficiency a critical concern. Our empirical analysis reveals that the benefit of using Long-CoT varies across problems: while some problems require elaborate reasoning, others show no improvement, or even degraded accuracy. This motivates adaptive reasoning strategies that tailor reasoning depth to the input. However, prior work primarily reduces redundancy within long reasoning paths, limiting exploration of more efficient strategies beyond the Long-CoT paradigm. To address this, we propose a novel two-stage framework for adaptive and efficient reasoning. First, we construct a hybrid reasoning model by merging long and short CoT models to enable diverse reasoning styles. Second, we apply bi-level preference training to guide the model to select suitable reasoning styles (group-level), and prefer concise and correct reasoning within each style group (instance-level). Experiments demonstrate that our method significantly reduces inference costs compared to other baseline approaches, while maintaining performance. Notably, on five mathematical datasets, the average length of reasoning is reduced by more than 50%, highlighting the potential of adaptive strategies to optimize reasoning efficiency in large language models. Our code is coming soon at https://github.com/StarDewXXX/AdaR1
摘要
近期,长链推理模型在复杂推理任务中展现出强大性能,但往往伴随显著的推理开销,使得效率成为关键问题。我们的实证分析表明,长链思维提示(Long-CoT)的效益因问题而异:部分问题需要精细推理,而另一些问题则未显现改进效果,甚至出现准确率下降。这促使我们研究根据输入动态调整推理深度的自适应策略。然而,现有工作主要集中于压缩长推理路径的冗余性,未能充分探索超越长链思维范式的高效策略。为此,我们提出一个新颖的两阶段自适应高效推理框架:首先通过融合长短链思维模型构建混合推理模型以实现多样化推理风格;其次采用双层偏好训练机制,指导模型在群体层面选择合适推理风格,并在风格组内实例层面优先选择简洁正确的推理路径。实验表明,本方法在保持性能的同时,较其他基线方法显著降低推理成本。值得注意的是,在五个数学数据集上,平均推理长度缩减超50%,凸显了自适应策略在优化大语言模型推理效率方面的潜力。代码即将发布于https://github.com/StarDewXXX/AdaR1。
Waking Up an AI: A Quantitative Framework for Prompt-Induced Phase Transition in Large Language Models
Abstract
arXiv:2504.21012v1 Announce Type: cross Abstract: What underlies intuitive human thinking? One approach to this question is to compare the cognitive dynamics of humans and large language models (LLMs). However, such a comparison requires a method to quantitatively analyze AI cognitive behavior under controlled conditions. While anecdotal observations suggest that certain prompts can dramatically change LLM behavior, these observations have remained largely qualitative. Here, we propose a two-part framework to investigate this phenomenon: a Transition-Inducing Prompt (TIP) that triggers a rapid shift in LLM responsiveness, and a Transition Quantifying Prompt (TQP) that evaluates this change using a separate LLM. Through controlled experiments, we examined how LLMs react to prompts embedding two semantically distant concepts (e.g., mathematical aperiodicity and traditional crafts)--either fused together or presented separately--by changing their linguistic quality and affective tone. Whereas humans tend to experience heightened engagement when such concepts are meaningfully blended producing a novel concept--a form of conceptual fusion--current LLMs showed no significant difference in responsiveness between semantically fused and non-fused prompts. This suggests that LLMs may not yet replicate the conceptual integration processes seen in human intuition. Our method enables fine-grained, reproducible measurement of cognitive responsiveness, and may help illuminate key differences in how intuition and conceptual leaps emerge in artificial versus human minds.
摘要
人类直觉思维的基础是什么?一种研究途径是比较人类与大型语言模型(LLMs)的认知动态。然而,这种比较需要一种在受控条件下定量分析AI认知行为的方法。尽管轶事观察表明某些提示能显著改变LLM行为,但这些观察大多停留在定性层面。本研究提出一个双部分框架来探究该现象:一是通过"过渡诱导提示"(TIP)触发LLM响应能力的快速转变,二是采用"过渡量化提示"(TQP)通过独立LLM评估这种变化。通过控制实验,我们检测了LLMs对嵌入两个语义疏离概念(如数学非周期性与传统工艺)提示的反应——无论这些概念是融合呈现还是分离呈现——并分析其语言质量和情感色调的变化。研究发现:当人类遇到有意义融合产生新概念的情况(即概念融合形式)时,其参与度往往会提升;而当前LLMs对语义融合与非融合提示的响应能力未表现出显著差异。这表明LLMs可能尚未复现人类直觉中的概念整合过程。本方法实现了认知响应能力的细粒度、可重复测量,或有助于揭示人工与人类心智中直觉和概念跃迁产生的关键差异。
PICO: Secure Transformers via Robust Prompt Isolation and Cybersecurity Oversight
Abstract
arXiv:2504.21029v1 Announce Type: cross Abstract: We propose a robust transformer architecture designed to prevent prompt injection attacks and ensure secure, reliable response generation. Our PICO (Prompt Isolation and Cybersecurity Oversight) framework structurally separates trusted system instructions from untrusted user inputs through dual channels that are processed independently and merged only by a controlled, gated fusion mechanism. In addition, we integrate a specialized Security Expert Agent within a Mixture-of-Experts (MoE) framework and incorporate a Cybersecurity Knowledge Graph (CKG) to supply domain-specific reasoning. Our training design further ensures that the system prompt branch remains immutable while the rest of the network learns to handle adversarial inputs safely. This PICO framework is presented via a general mathematical formulation, then elaborated in terms of the specifics of transformer architecture, and fleshed out via hypothetical case studies including Policy Puppetry attacks. While the most effective implementation may involve training transformers in a PICO-based way from scratch, we also present a cost-effective fine-tuning approach.
摘要
我们提出一种鲁棒的Transformer架构,旨在防范提示注入攻击并确保安全可靠的内容生成。通过PICO(提示隔离与网络安全监督)框架,采用双通道结构设计将可信系统指令与不可信用户输入进行物理隔离——这两个通道独立处理,仅通过受控的门控融合机制实现最终合并。该框架在混合专家系统(MoE)中集成了专业安全代理模块,并引入网络安全知识图谱(CKG)以提供领域特异性推理能力。我们的训练方案确保系统提示分支保持不可变性,同时网络其余部分学会安全处理对抗性输入。本文首先给出PICO框架的通用数学表述,继而详细阐述其在Transformer架构中的具体实现,最后通过包括"策略傀儡攻击"在内的假设案例进行验证。虽然最有效的实施方案是从头开始基于PICO方法训练Transformer,但我们也提出了一种经济高效的微调方案。
Selecting the Right LLM for eGov Explanations
Abstract
arXiv:2504.21032v1 Announce Type: cross Abstract: The perceived quality of the explanations accompanying e-government services is key to gaining trust in these institutions, consequently amplifying further usage of these services. Recent advances in generative AI, and concretely in Large Language Models (LLMs) allow the automation of such content articulations, eliciting explanations' interpretability and fidelity, and more generally, adapting content to various audiences. However, selecting the right LLM type for this has become a non-trivial task for e-government service providers. In this work, we adapted a previously developed scale to assist with this selection, providing a systematic approach for the comparative analysis of the perceived quality of explanations generated by various LLMs. We further demonstrated its applicability through the tax-return process, using it as an exemplar use case that could benefit from employing an LLM to generate explanations about tax refund decisions. This was attained through a user study with 128 survey respondents who were asked to rate different versions of LLM-generated explanations about tax refund decisions, providing a methodological basis for selecting the most appropriate LLM. Recognizing the practical challenges of conducting such a survey, we also began exploring the automation of this process by attempting to replicate human feedback using a selection of cutting-edge predictive techniques.
摘要
电子政务服务所附解释内容的感知质量是获取公众信任的关键因素,这种信任将促进服务的进一步使用。生成式人工智能(尤其是大语言模型)的最新进展使得此类解释内容能够自动化生成,从而提升解释的可解释性与保真度,并实现面向不同受众的内容适配。然而,如何选择合适的大语言模型类型已成为电子政务服务提供商面临的重要课题。本研究基于既有量表进行改进,通过系统化方法对比分析不同大语言模型生成解释的感知质量差异。我们以税务申报流程作为示范用例,验证该量表的适用性——该场景可通过大语言模型生成税款退还决策的解释而获益。我们开展了包含128名调查对象的用户研究,要求受试者对不同版本的大语言模型生成解释进行评分,从而为模型选择提供方法论依据。鉴于实施此类调查存在实际困难,我们还尝试采用前沿预测技术模拟人类反馈,初步探索该流程的自动化实现路径。
UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models
Abstract
arXiv:2504.21027v1 Announce Type: cross Abstract: The advent of Large Language Models (LLMs) holds promise for revolutionizing various fields traditionally dominated by human expertise. Urban planning, a professional discipline that fundamentally shapes our daily surroundings, is one such field heavily relying on multifaceted domain knowledge and experience of human experts. The extent to which LLMs can assist human practitioners in urban planning remains largely unexplored. In this paper, we introduce a comprehensive benchmark, UrbanPlanBench, tailored to evaluate the efficacy of LLMs in urban planning, which encompasses fundamental principles, professional knowledge, and management and regulations, aligning closely with the qualifications expected of human planners. Through extensive evaluation, we reveal a significant imbalance in the acquisition of planning knowledge among LLMs, with even the most proficient models falling short of meeting professional standards. For instance, we observe that 70% of LLMs achieve subpar performance in understanding planning regulations compared to other aspects. Besides the benchmark, we present the largest-ever supervised fine-tuning (SFT) dataset, UrbanPlanText, comprising over 30,000 instruction pairs sourced from urban planning exams and textbooks. Our findings demonstrate that fine-tuned models exhibit enhanced performance in memorization tests and comprehension of urban planning knowledge, while there exists significant room for improvement, particularly in tasks requiring domain-specific terminology and reasoning. By making our benchmark, dataset, and associated evaluation and fine-tuning toolsets publicly available at https://github.com/tsinghua-fib-lab/PlanBench, we aim to catalyze the integration of LLMs into practical urban planning, fostering a symbiotic collaboration between human expertise and machine intelligence.
摘要
大型语言模型(LLM)的出现为传统由人类专业知识主导的各个领域带来了革命性变革的曙光。城市规划作为从根本上塑造我们日常环境的专业学科,正是这样一个高度依赖人类专家多领域知识和经验的领域。LLM能在多大程度上辅助城市规划从业者,目前仍属未知领域。本文提出了一个综合性基准测试UrbanPlanBench,专门用于评估LLM在城市规划中的效能,该基准涵盖基本原理、专业知识及管理与法规,与人类规划师应具备的资质紧密契合。通过广泛评估,我们发现LLM在规划知识获取方面存在显著不平衡性,即使最先进的模型也未能达到专业标准。例如,我们观察到70%的LLM在理解规划法规方面表现欠佳。除基准测试外,我们还构建了有史以来最大的监督微调(SFT)数据集UrbanPlanText,包含来自城市规划考试和教科书的30,000余条指令对。研究结果表明,经过微调的模型在记忆测试和城市规划知识理解方面表现更优,但在需要领域专业术语和推理能力的任务上仍有较大提升空间。我们已将基准测试、数据集及相关评估与微调工具集公开于https://github.com/tsinghua-fib-lab/PlanBench,旨在推动LLM与城市规划实践的融合,促进人类专业知识与机器智能的协同合作。
Semantic-Aware Contrastive Fine-Tuning: Boosting Multimodal Malware Classification with Discriminative Embeddings
Abstract
arXiv:2504.21028v1 Announce Type: cross Abstract: The rapid evolution of malware variants requires robust classification methods to enhance cybersecurity. While Large Language Models (LLMs) offer potential for generating malware descriptions to aid family classification, their utility is limited by semantic embedding overlaps and misalignment with binary behavioral features. We propose a contrastive fine-tuning (CFT) method that refines LLM embeddings via targeted selection of hard negative samples based on cosine similarity, enabling LLMs to distinguish between closely related malware families. Our approach combines high-similarity negatives to enhance discriminative power and mid-tier negatives to increase embedding diversity, optimizing both precision and generalization. Evaluated on the CIC-AndMal-2020 and BODMAS datasets, our refined embeddings are integrated into a multimodal classifier within a Model-Agnostic Meta-Learning (MAML) framework on a few-shot setting. Experiments demonstrate significant improvements: our method achieves 63.15% classification accuracy with as few as 20 samples on CIC-AndMal-2020, outperforming baselines by 11--21 percentage points and surpassing prior negative sampling strategies. Ablation studies confirm the superiority of similarity-based selection over random sampling, with gains of 10-23%. Additionally, fine-tuned LLMs generate attribute-aware descriptions that generalize to unseen variants, bridging textual and binary feature gaps. This work advances malware classification by enabling nuanced semantic distinctions and provides a scalable framework for adapting LLMs to cybersecurity challenges.
摘要
恶意软件变种的快速演化需要鲁棒的分类方法来增强网络安全。尽管大语言模型(LLMs)具有生成恶意软件描述以辅助家族分类的潜力,但其效用受限于语义嵌入重叠及与二进制行为特征的错位。我们提出一种对比微调(CFT)方法,通过基于余弦相似度的困难负样本定向选择来优化LLM嵌入,使LLMs能够区分密切相关的恶意软件家族。该方法结合高相似度负样本以增强判别力,并采用中阶相似度负样本提升嵌入多样性,从而同时优化精度与泛化能力。在CIC-AndMal-2020和BODMAS数据集上的评估显示,改进后的嵌入被集成至模型无关元学习(MAML)框架下的多模态分类器中,采用小样本设置。实验表明显著提升:我们的方法在CIC-AndMal-2020上仅需20个样本即达到63.15%分类准确率,较基线方法提高11-21个百分点,并超越现有负采样策略。消融实验证实基于相似度的选择优于随机采样,增益达10-23%。此外,经微调的LLMs生成的属性感知描述可泛化至未见变种,弥合了文本与二进制特征间的鸿沟。本研究通过实现精细语义区分推进了恶意软件分类,并为LLMs适应网络安全挑战提供了可扩展框架。
ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees
Abstract
arXiv:2504.21022v1 Announce Type: cross Abstract: Linear Temporal Logic (LTL) has become a prevalent specification language for robotic tasks. To mitigate the significant manual effort and expertise required to define LTL-encoded tasks, several methods have been proposed for translating Natural Language (NL) instructions into LTL formulas, which, however, lack correctness guarantees. To address this, we introduce a new NL-to-LTL translation method, called ConformalNL2LTL, that can achieve user-defined translation success rates over unseen NL commands. Our method constructs LTL formulas iteratively by addressing a sequence of open-vocabulary Question-Answering (QA) problems with LLMs. To enable uncertainty-aware translation, we leverage conformal prediction (CP), a distribution-free uncertainty quantification tool for black-box models. CP enables our method to assess the uncertainty in LLM-generated answers, allowing it to proceed with translation when sufficiently confident and request help otherwise. We provide both theoretical and empirical results demonstrating that ConformalNL2LTL achieves user-specified translation accuracy while minimizing help rates.
摘要
线性时序逻辑(LTL)已成为机器人任务的主流规约语言。为减少定义LTL编码任务所需的大量人工操作与专业知识,现有研究提出了多种将自然语言(NL)指令转换为LTL公式的方法,但这些方法缺乏正确性保证。为此,我们提出了一种名为ConformalNL2LTL的新型NL-to-LTL翻译方法,能够对未见过的自然语言命令实现用户自定义的翻译成功率。该方法通过利用大语言模型(LLM)处理一系列开放词汇问答(QA)问题,迭代式构建LTL公式。为实现不确定性感知的翻译,我们采用无分布不确定性量化工具——保形预测(CP)来评估LLM生成答案的不确定性,仅在置信度充足时继续翻译,否则请求人工协助。理论与实证结果表明,ConformalNL2LTL在满足用户指定翻译精度的同时,能最小化求助率。
Creating and Evaluating Code-Mixed Nepali-English and Telugu-English Datasets for Abusive Language Detection Using Traditional and Deep Learning Models
Abstract
arXiv:2504.21026v1 Announce Type: cross Abstract: With the growing presence of multilingual users on social media, detecting abusive language in code-mixed text has become increasingly challenging. Code-mixed communication, where users seamlessly switch between English and their native languages, poses difficulties for traditional abuse detection models, as offensive content may be context-dependent or obscured by linguistic blending. While abusive language detection has been extensively explored for high-resource languages like English and Hindi, low-resource languages such as Telugu and Nepali remain underrepresented, leaving gaps in effective moderation. In this study, we introduce a novel, manually annotated dataset of 2 thousand Telugu-English and 5 Nepali-English code-mixed comments, categorized as abusive and non-abusive, collected from various social media platforms. The dataset undergoes rigorous preprocessing before being evaluated across multiple Machine Learning (ML), Deep Learning (DL), and Large Language Models (LLMs). We experimented with models including Logistic Regression, Random Forest, Support Vector Machines (SVM), Neural Networks (NN), LSTM, CNN, and LLMs, optimizing their performance through hyperparameter tuning, and evaluate it using 10-fold cross-validation and statistical significance testing (t-test). Our findings provide key insights into the challenges of detecting abusive language in code-mixed settings and offer a comparative analysis of computational approaches. This study contributes to advancing NLP for low-resource languages by establishing benchmarks for abusive language detection in Telugu-English and Nepali-English code-mixed text. The dataset and insights can aid in the development of more robust moderation strategies for multilingual social media environments.
摘要
随着社交媒体上多语言用户数量的增长,检测语码混合文本中的侮辱性语言变得日益困难。用户在英语和母语之间无缝切换的语码混合交流方式,给传统侮辱内容检测模型带来了挑战,因为冒犯性内容可能依赖于上下文或被语言混合所掩盖。尽管针对英语和印地语等高资源语言的侮辱性语言检测已有广泛研究,但泰卢固语和尼泊尔语等低资源语言仍存在研究空白,导致有效内容审核的不足。本研究引入了一个新颖的手工标注数据集,包含从多个社交媒体平台收集的2000条泰卢固语-英语和500条尼泊尔语-英语语码混合评论,按侮辱性和非侮辱性分类。数据集经过严格预处理后,在多种机器学习(ML)、深度学习(DL)和大语言模型(LLM)上进行评估。我们实验了包括逻辑回归、随机森林、支持向量机(SVM)、神经网络(NN)、LSTM、CNN和LLM在内的模型,通过超参数调优优化其性能,并使用10折交叉验证和统计显著性检验(t检验)进行评估。研究结果揭示了语码混合环境下检测侮辱性语言的关键挑战,并提供了不同计算方法的对比分析。本研究通过建立泰卢固语-英语和尼泊尔语-英语语码混合文本的侮辱性语言检测基准,推动了低资源语言自然语言处理的发展。该数据集和研究成果可为多语言社交媒体环境开发更强大的内容审核策略提供支持。
Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations
Abstract
arXiv:2504.21019v1 Announce Type: cross Abstract: The growing popularity of large language models has raised concerns regarding the potential to misuse AI-generated text (AIGT). It becomes increasingly critical to establish an excellent AIGT detection method with high generalization and robustness. However, existing methods either focus on model generalization or concentrate on robustness. The unified mechanism, to simultaneously address the challenges of generalization and robustness, is less explored. In this paper, we argue that robustness can be view as a specific form of domain shift, and empirically reveal an intrinsic mechanism for model generalization of AIGT detection task. Then, we proposed a novel AIGT detection method (DP-Net) via dynamic perturbations introduced by a reinforcement learning with elaborated reward and action. Experimentally, extensive results show that the proposed DP-Net significantly outperforms some state-of-the-art AIGT detection methods for generalization capacity in three cross-domain scenarios. Meanwhile, the DP-Net achieves best robustness under two text adversarial attacks. The code is publicly available at https://github.com/CAU-ISS-Lab/AIGT-Detection-Evade-Detection/tree/main/DP-Net.
摘要
随着大型语言模型的日益普及,人工智能生成文本(AIGT)的潜在滥用风险引发广泛关注。建立兼具高泛化性和强鲁棒性的AIGT检测方法变得至关重要。然而,现有方法或侧重模型泛化性,或聚焦鲁棒性,对同时解决泛化与鲁棒性挑战的统一机制探索不足。本文提出鲁棒性可视为领域偏移的特殊形式,并通过实证揭示了AIGT检测任务中模型泛化的内在机制。基于此,我们提出一种新型动态扰动检测方法(DP-Net),通过强化学习框架结合精心设计的奖励函数与动作空间引入动态扰动。大量实验表明,在三种跨域场景下,DP-Net的泛化能力显著优于当前最先进的AIGT检测方法;同时在两种文本对抗攻击下均展现出最佳鲁棒性。代码已开源:https://github.com/CAU-ISS-Lab/AIGT-Detection-Evade-Detection/tree/main/DP-Net。